ggplot2ggplot2 is an R package for creating elegant and
flexible data visualizations. It follows the Grammar of Graphics
principles, which allow users to build complex plots incrementally by
adding layers.
Data: The dataset you are working with.
Aesthetics (aes): The visual properties of the plot (e.g., x, y, color, size, shape).
Geometries (geoms): The type of plot you want (e.g., scatter, bar, histogram).
Faceting: Splitting the data into multiple plots.
Themes and Labels: Customizing the appearance of the plot.
A ggplot visualization always follows this basic
structure:
p <- ggplot(data, aes(x, y)) + geom_someplot()
A histogram is a type of data visualization that represents the distribution of a continuous numerical variable. It consists of bins (intervals) along the x-axis and bars whose heights represent the frequency (count) of data points within each bin.
Key Features of a Histogram: * The x-axis represents the range of values for the variable. * The y-axis represents the frequency (or density) of observations within each bin. * Unlike bar charts, which display discrete categories, histograms are used for continuous data.
he number of bins affects the level of detail: too few bins can obscure patterns, while too many bins may create excessive noise.
# Basic histogram in ggplot
ggplot(starwars, aes(x = mass)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Setting binwidth to different values
ggplot(starwars, aes(x = mass)) +
geom_histogram(bins = 5)
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(starwars, aes(x = mass)) +
geom_histogram(bins = 20)
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(starwars, aes(x = mass)) +
geom_histogram(bins = 30)
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(starwars, aes(x = mass)) +
geom_histogram(bins = 50)
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
We can use the fill argument to either add color:
ggplot(starwars, aes(x = mass)) +
geom_histogram(bins = 50, fill = "blue") # the fill here needs to go in the geom
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
or, to differentiate between a qualitaitve variable and produce two overlapping histograms:
ggplot(starwars, aes(x = mass, fill = gender)) + # the fill here needs to go in the `aes` function
geom_histogram(bins = 50)
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
In this case, the feminine and masculine histograms
are quite overlapping and hard to read. We can make less dark with the
alpha function.
ggplot(starwars, aes(x = mass, fill = gender, alpha=.3)) + # the fill here needs to go in the `aes` function
geom_histogram(bins = 50)
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_bin()`).
But we still can’t see through the histogram.
A density plot is a smoothed version of a histogram that estimates the probability density function (PDF) of a continuous variable. Instead of using discrete bins like a histogram, a density plot uses a kernel density estimation (KDE) technique to create a continuous curve that represents the distribution of data.
Key Features of a Density Plot: * The x-axis represents the values of the variable. * The y-axis represents the estimated density (not frequency counts, as in histograms). * The area under the curve sums to 1, making it useful for comparing distributions.
A smoother alternative to histograms, better suited for identifying multiple peaks in the data.
ggplot(starwars, aes(x = mass)) +
geom_density(fill = "blue", alpha = 0.4) # here the alpha make the area under the curve more transparent
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_density()`).
Here we can see through each plot:
ggplot(starwars, aes(x = mass, fill = gender)) +
geom_density(alpha= 0.4) # here the alpha make the area under the curve more transparent
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_density()`).
library(ggridges)
ggplot(starwars, aes(x = mass, y = gender, fill = gender)) +
geom_density_ridges(alpha = 0.6)
## Picking joint bandwidth of 7.81
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_density_ridges()`).
A bar plot (or bar chart) is a type of visualization that represents categorical data using rectangular bars. The height of each bar corresponds to the count or value of the category it represents. Bar plots are useful for comparing frequencies, proportions, or summary statistics across categories.
Key Features of a Bar Plot: * The x-axis represents categories (e.g., gender, species, or group labels). * The y-axis represents values (e.g., counts, means, or sums). * Bars can be grouped or stacked to compare subcategories.
Can display counts (frequency) or summary statistics (e.g., mean values).
ggplot(starwars, aes(x = gender)) +
geom_bar(fill = "blue", alpha = 0.6)
Example: Bar Plot with Summary Statistics
If you want to plot the average height per species, you can use
stat = "summary" to calculate the mean:
ggplot(starwars, aes(x = species, y = height)) +
geom_bar(stat = "summary", fun = "mean", fill = "darkgreen") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate labels for readability
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_summary()`).
A stacked bar chart is a variation of the standard bar plot where bars are divided into segments representing different subcategories. Each bar still represents a primary categorical variable, but within each bar, different colors show the proportions of subcategories.
Key Features of a Stacked Bar Chart: * The x-axis represents the main categorical variable. * The y-axis represents counts or a summary statistic. * Each bar is divided into subcategories, stacked on top of each other. * Colors differentiate the subcategories.
Example: Stacked Bar Chart (Counts)
This example shows how gender is distributed within each species in the Star Wars dataset.
ggplot(starwars, aes(x = species, fill = gender)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
The fill = gender argument ensures that each bar is
divided by gender.
The stacked segments show how many characters of each gender exist within each species.
Example: Proportional (100%) Stacked Bar Chart To
visualize proportions instead of absolute counts, use
position = "fill", which makes each bar equal in height
(100%) but displays the relative distribution of subcategories.
ggplot(starwars, aes(x = species, fill = gender)) +
geom_bar(position = "fill") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels
A box plot (also called a box-and-whisker plot) is a statistical visualization that summarizes the distribution of a dataset by displaying key summary statistics. It is particularly useful for comparing distributions across multiple groups and identifying outliers.
Key Components of a Box Plot:
Example: Basic Box Plot This example visualizes the distribution of character height in the Star Wars dataset.
ggplot(starwars, aes(x = "", y = height)) +
geom_boxplot(fill = "lightblue", color = "black") +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The box represents the IQR (middle 50% of the data).
The whiskers extend to the smallest and largest values within 1.5 × IQR.
Any points beyond the whiskers are outliers, shown as dots.
Example: Grouped Box Plot
To compare distributions across categories, such as height by gender:
ggplot(starwars, aes(x = gender, y = height, fill = gender)) +
geom_boxplot() +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The x-axis represents different categories.
Each box shows the distribution of height within a gender.
Outliers (if present) appear as dots outside the whiskers.
To show individual data points alongside the box plot, we add
geom_jitter():
ggplot(starwars, aes(x = gender, y = height, fill = gender)) +
geom_boxplot(alpha = 0.5) +
geom_jitter(color = "black", width = 0.2, alpha = 0.5) +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).
geom_jitter() spreads points horizontally to avoid
overlap.
This helps visualize density and distribution of observations.
A violin plot is a data visualization that combines aspects of a box plot and a density plot, providing insights into the distribution and probability density of a dataset. It is particularly useful for visualizing variations across different categories while retaining information about the distribution shape.
Key Features of a Violin Plot
ggplot(starwars, aes(x = "", y = height)) +
geom_violin(fill = "lightblue", color = "black") +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
To compare height distributions by gender:
ggplot(starwars, aes(x = gender, y = height, fill = gender)) +
geom_violin() +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
The shape of each violin reveals how height varies within each gender.
A wider section means more characters fall in that height range.
A box plot inside a violin plot provides both a summary (box plot) and detailed distribution (violin).
ggplot(starwars, aes(x = gender, y = height, fill = gender)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.1, fill = "white") +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
A box plot inside a violin plot provides both a summary (box plot) and
detailed distribution (violin).
Adding Jittered Points
To see raw data points, we use geom_jitter():
ggplot(starwars, aes(x = gender, y = height, fill = gender)) +
geom_violin(alpha = 0.5) +
geom_jitter(color = "black", width = 0.2, alpha = 0.5) +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).
Jittered points avoid overlap, making individual data points visible.
When to Use Violin Plots:
✔ When comparing distributions across categories. ✔ When the underlying shape of the data is important. ✔ When you want to show both summary statistics and density.
Not ideal for small sample sizes, as the density estimation may be misleading.
A mosaic plot is a graphical representation of contingency tables, showing the relationship between two or more categorical variables. It is an extension of a stacked bar chart, where both area and proportion convey information about the frequency distribution.
Key Features of a Mosaic Plot * Displays contingency tables in a visual format. * The size of each rectangle represents the proportion of observations in that category. * Both row and column proportions are visually represented. * Ideal for categorical data and analyzing relationships between categories.
# Load the required libraries
library(ggplot2)
library(ggmosaic)
Example 1: Creating a Basic Mosaic Plot We’ll use
the built-in Titanic dataset, which contains survival data
categorized by class, sex, and age.
# Convert Titanic dataset into a data frame
titanic_df <- as.data.frame(Titanic)
# Basic mosaic plot: Passenger Class vs. Survival
ggplot(data = titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Class), fill = Survived)) +
theme_minimal()
## Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
## 3.5.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Example 2: Adding More Categories We can extend the mosaic plot to include both Passenger Class and Gender to show more relationships.
ggplot(data = titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Class, Sex), fill = Survived)) +
theme_minimal()
Understanding the Mosaic Plot
When to Use a Mosaic Plot ✔ When visualizing relationships between two or more categorical variables. ✔ When looking at proportional relationships rather than raw counts. ✔ When an alternative to stacked bar charts is needed for clearer interpretation.
Not ideal for continuous data, as it only supports categorical variables.
A scatter plot is a visualization that displays the relationship between two numerical variables. Each point on the plot represents an observation, with the x-axis representing one variable and the y-axis representing another. Scatter plots are useful for identifying correlations, clusters, outliers, and trends in data.
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.3.3
We start with a simple scatter plot showing the relationship between flipper length and body mass.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
labs(title = "Basic Scatter Plot: Flipper Length vs. Body Mass",
x = "Flipper Length (mm)",
y = "Body Mass (g)")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
To make the visualization more insightful, we color the points by species.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point(size = 3) +
labs(title = "Scatter Plot with Color: Species Differentiation",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Interpretation: Now we can see differences among species based on their body size and flipper length.
We can add different shapes to differentiate sex and adjust transparency (alpha) to deal with overlapping points.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species, shape = sex)) +
geom_jitter(size = 3, alpha = 0.7, width = 2, height = 100) + # Add jitter with slight adjustments
labs(title = "Scatter Plot with Jitter to Reduce Overlapping Points",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
Now let’s add a regression line to visualize the trend in the relationship.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, color = species, shape = sex), size = 3) + # Color only for points
geom_smooth(method = "lm", se = FALSE, color = "black") + # One regression line for all data
labs(title = "Scatter Plot with a Single Regression Line",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
## Warning: Duplicated aesthetics after name standardisation: colour
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Duplicated aesthetics after name standardisation: colour
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can size the point to scale with some other quantitative variable:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, color = species, shape = sex, size = bill_depth_mm)) +
geom_smooth(method = "lm", se = FALSE, color = "black") + # One regression line for all data
labs(title = "Scatter Plot with a Single Regression Line",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
## Warning: Duplicated aesthetics after name standardisation: colour
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Duplicated aesthetics after name standardisation: colour
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
Instead of a linear trend, we can use a LOESS (locally estimated scatterplot smoothing) curve, which better captures complex relationships.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, color = species, shape = sex, size = bill_depth_mm)) +
geom_smooth(method = "loess", se = TRUE, color = "black") + # One regression line for all data
labs(title = "Scatter Plot with a Single Regression Line",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
## Warning: Duplicated aesthetics after name standardisation: colour
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Duplicated aesthetics after name standardisation: colour
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
To further explore species differences, we facet the scatter plot into separate plots for each species.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species), size = 3) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ species) +
labs(title = "Faceted Scatter Plots by Species",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
setwd("~/Library/Mobile Documents/com~apple~CloudDocs/Documents/R/R projects/raw2refined")
## In the csv file ".." corresponds to missing data, so we use the 'na' argument to tell R that .. is missing data
wb_data <- read_csv("data/c252e189-29d1-48ec-b29e-dbf778bb67d2_Data.csv", na = "..")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 873 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Country Name, Country Code, Series Name, Series Code
## dbl (1): 2020 [YR2020]
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Next, we want to remove the rows where there is missing data
## The na.omit function removes all the rows where there are "NA" (according to R, which is just NA)
wb_data <- na.omit(wb_data)
# Lets select only the columns that we need
wb_data_vars <- wb_data %>% # This, %>% , is called the pipe operator, it's from dplyr and allows you to do multiple functions in one bit of code
select(`Country Name`, `Country Code`, `Series Name`, `2020 [YR2020]`)
# These data are in long format and we want to transform to wide format
## We use pivot_wider to transform to wide format
## We tell it the columns that identify the unique observations, here country
## We tell it the column that gives the names of the variables
## We tell it the column that gives us the values that go along with variables
wb_data_vars_wide <- pivot_wider(wb_data_vars, id_cols = c(`Country Name`, `Country Code`),
names_from = `Series Name`,
values_from = `2020 [YR2020]`
)
## Now we have wb_data_vars_wide
# We still have some missing values
## Again, we remove any row that has at least one missing value
wb_data_vars_wide_noNAs <- na.omit(wb_data_vars_wide)
## Variable names are long and annoying
wb_data_clean <- wb_data_vars_wide_noNAs %>%
mutate(
countryName = `Country Name`,
countryCode = `Country Code`,
gdpCurrentUSD = `GDP (current US$)`,
accessToElectricity = `Access to electricity (% of population)`,
malariaIncidence = `Incidence of malaria (per 1,000 population at risk)`,
schoolEnrollmentGPI = `School enrollment, primary (gross), gender parity index (GPI)`
) %>%
select(-`Country Name`, -`Country Code`, -`GDP (current US$)`,
-`Access to electricity (% of population)`,
-`Incidence of malaria (per 1,000 population at risk)`,
-`School enrollment, primary (gross), gender parity index (GPI)`) # Remove old names
# Check updated column names
names(wb_data_clean)
## [1] "countryName" "countryCode" "gdpCurrentUSD"
## [4] "accessToElectricity" "malariaIncidence" "schoolEnrollmentGPI"
# Read country metadata
url <- "https://raw.githubusercontent.com/zhgarfield/raw2refined/main/data/country_metadata.csv"
country_metadata <- read.csv(url, stringsAsFactors = FALSE)
head(country_metadata)
## name alpha.2 alpha.3 country.code iso_3166.2 region
## 1 Afghanistan AF AFG 4 ISO 3166-2:AF Asia
## 2 Ã…land Islands AX ALA 248 ISO 3166-2:AX Europe
## 3 Albania AL ALB 8 ISO 3166-2:AL Europe
## 4 Algeria DZ DZA 12 ISO 3166-2:DZ Africa
## 5 American Samoa AS ASM 16 ISO 3166-2:AS Oceania
## 6 Andorra AD AND 20 ISO 3166-2:AD Europe
## sub.region intermediate.region region.code sub.region.code
## 1 Southern Asia 142 34
## 2 Northern Europe 150 154
## 3 Southern Europe 150 39
## 4 Northern Africa 2 15
## 5 Polynesia 9 61
## 6 Southern Europe 150 39
## intermediate.region.code
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
# Create new ID to match wb_data
country_metadata$countryCode <- country_metadata$alpha.3
# Merge data
wb_data_clean <- left_join(wb_data_clean, country_metadata, by = "countryCode")
A nice ggplot:
library(ggplot2)
library(dplyr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Clean dataset: Removing NAs and ensuring meaningful malaria incidence values
# Clean dataset: Removing NAs and ensuring meaningful malaria incidence values
wb_data_plot <- wb_data_clean %>%
filter(!is.na(gdpCurrentUSD), !is.na(accessToElectricity), !is.na(malariaIncidence)) %>%
mutate(
log_gdp = log10(gdpCurrentUSD),
malariaScaled = sqrt(malariaIncidence) # Scaling malaria incidence for better visualization
)
# Plot
ggplot(wb_data_plot, aes(x = log_gdp, y = accessToElectricity, color = region, size = malariaScaled)) +
geom_point(alpha = 0.7) + # Scatter points with transparency
geom_smooth(method = "loess", se = FALSE, color = "black", linetype = "dashed", size = 1) + # LOESS trend line
scale_size_continuous(range = c(2, 10), name = "Malaria Incidence (scaled)") + # Adjust point sizes
#scale_x_continuous() + # Improve readability of GDP values
labs(
title = "GDP, Electricity Access, and Malaria Incidence",
subtitle = "Higher GDP correlates with electricity access, but malaria incidence varies",
x = "Log GDP (Current USD)",
y = "Access to Electricity (%)",
caption = "Data Source: World Bank"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "right",
panel.grid.major = element_line(color = "grey90"),
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.5, size = 18),
plot.subtitle = element_text(hjust = 0.5, size = 14)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'